In this case study we will use various supervised learning-based models to predict the stock price of Microsoft using correlated assets and its own historical data.
In this case study, we will be extend a case model shown in:
In the seed code, other than the historical datqa of Microsoft, the independent variables used are the following potentially correlated assets:
In this framework, we will be extending this case study by doing the followng:
# Load libraries
import numpy as np
import pandas as pd
import pandas_datareader.data as web # type: ignore
import yfinance as yf
from matplotlib import pyplot
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import Pipeline
# For Feature Engineering
import talib as ta
# For Feature Selection
from sklearn.feature_selection import SelectPercentile, f_regression, mutual_info_regression
from sklearn.decomposition import PCA
#Libraries for Deep Learning Models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import LSTM
from scikeras.wrappers import KerasRegressor
#Libraries for Statistical Models
import statsmodels.api as sm
#Libraries for Saving the Model
from pickle import dump
from pickle import load
# Time series Models
from statsmodels.tsa.arima.model import ARIMA
from pmdarima import auto_arima
# Error Metrics
from sklearn.metrics import mean_squared_error
# Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
#Plotting
from pandas.plotting import scatter_matrix
from statsmodels.graphics.tsaplots import plot_acf
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
# Set global seed
np.random.seed(7)
Next, we extract the data required for our analysis using pandas datareader.
stk_tickers = ['MSFT', 'IBM', 'GOOGL'] #Google has too many NAN value
# stk_tickers = ['MSFT', 'IBM', 'AAPL']
ccy_tickers = ['DEXJPUS', 'DEXUSUK']
idx_tickers = ['SP500', 'DJIA', 'VIXCLS']
ccy_data = web.DataReader(ccy_tickers, 'fred')
idx_data = web.DataReader(idx_tickers, 'fred')
start = '2019-12-12'; end = '2024-12-06'
stk_data = yf.download(stk_tickers, start, end)
[*********************100%***********************] 3 of 3 completed
Next, we need a series to predict. We choose to predict using weekly returns. We approximate this by using 5 business day period returns.
return_period = 5
We now define our Y series and our X series
Y: MSFT Future Returns
X:
a. GOOGL 5 Business Day Returns
b. IBM 5 Business DayReturns
c. USD/JPY 5 Business DayReturns
d. GBP/USD 5 Business DayReturns
e. S&P 500 5 Business DayReturns
f. Dow Jones 5 Business DayReturns
g. MSFT 5 Business Day Returns
h. MSFT 15 Business Day Returns
i. MSFT 30 Business Day Returns
j. MSFT 60 Business Day Returns
We remove the MSFT past returns when we use the Time series models.
stk_data
| Price | Adj Close | Close | High | Low | Open | Volume | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker | GOOGL | IBM | MSFT | GOOGL | IBM | MSFT | GOOGL | IBM | MSFT | GOOGL | IBM | MSFT | GOOGL | IBM | MSFT | GOOGL | IBM | MSFT |
| Date | ||||||||||||||||||
| 2019-12-12 | 67.180969 | 102.522102 | 146.571854 | 67.424500 | 129.369019 | 153.240005 | 67.728996 | 129.694077 | 153.440002 | 66.910004 | 127.782028 | 151.020004 | 67.160500 | 127.820267 | 151.649994 | 29114000 | 5046009 | 24612100 |
| 2019-12-13 | 67.100258 | 101.681129 | 147.805664 | 67.343498 | 128.307846 | 154.529999 | 67.567497 | 129.541107 | 154.889999 | 67.083000 | 128.116638 | 152.830002 | 67.394997 | 128.824097 | 153.000000 | 33170000 | 2651610 | 23845400 |
| 2019-12-16 | 67.789261 | 101.620522 | 148.762161 | 68.035004 | 128.231354 | 155.529999 | 68.176498 | 129.493301 | 155.899994 | 67.530502 | 127.963669 | 154.820007 | 67.750000 | 129.005737 | 155.110001 | 28128000 | 3189463 | 24144200 |
| 2019-12-17 | 67.499809 | 101.688721 | 147.958694 | 67.744499 | 128.317398 | 154.690002 | 68.216499 | 128.766724 | 155.710007 | 67.538498 | 127.590820 | 154.449997 | 68.120499 | 128.374756 | 155.449997 | 32948000 | 3040931 | 25425600 |
| 2019-12-18 | 67.351341 | 101.832649 | 147.652649 | 67.595497 | 128.499039 | 154.369995 | 67.971497 | 129.063095 | 155.479996 | 67.523003 | 128.250473 | 154.179993 | 67.849998 | 128.632889 | 154.300003 | 23330000 | 3244483 | 24129200 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-11-29 | 168.756592 | 227.410004 | 423.459991 | 168.949997 | 227.410004 | 423.459991 | 169.429993 | 230.360001 | 424.880005 | 167.160004 | 227.190002 | 417.799988 | 168.500000 | 227.750000 | 420.089996 | 14257200 | 2640300 | 16271900 |
| 2024-12-02 | 171.293686 | 227.389999 | 430.980011 | 171.490005 | 227.389999 | 430.980011 | 172.080002 | 228.380005 | 433.000000 | 168.570007 | 225.509995 | 421.309998 | 168.770004 | 227.500000 | 421.570007 | 23789100 | 2655900 | 20207200 |
| 2024-12-03 | 171.143845 | 229.000000 | 431.200012 | 171.339996 | 229.000000 | 431.200012 | 172.679993 | 229.110001 | 432.470001 | 170.850006 | 226.669998 | 427.739990 | 171.490005 | 227.240005 | 429.839996 | 22248700 | 3163800 | 18302000 |
| 2024-12-04 | 174.170380 | 233.490005 | 437.420013 | 174.369995 | 233.490005 | 437.420013 | 174.910004 | 233.740005 | 439.670013 | 171.059998 | 229.350006 | 432.630005 | 171.149994 | 230.000000 | 433.029999 | 31615100 | 4104200 | 26009400 |
| 2024-12-05 | 172.442368 | 234.750000 | 442.619995 | 172.639999 | 234.750000 | 442.619995 | 176.059998 | 236.520004 | 444.660004 | 172.330002 | 233.460007 | 436.170013 | 175.360001 | 233.550003 | 437.920013 | 21356200 | 4791100 | 21697800 |
1254 rows × 18 columns
Y = np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(return_period).shift(-return_period)
Y.name = Y.name[-1]+'_pred'
X1 = np.log(stk_data.loc[:, ('Adj Close', ('GOOGL', 'IBM'))]).diff(return_period)
X1.columns = X1.columns.droplevel()
X2 = np.log(ccy_data).diff(return_period)
X3 = np.log(idx_data).diff(return_period)
X4 = pd.concat([np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(i) for i in [return_period, return_period*3, return_period*6, return_period*12]], axis=1).dropna()
X4.columns = ['MSFT_DT', 'MSFT_3DT', 'MSFT_6DT', 'MSFT_12DT']
# Create multiple indicators first then evaluate the predictive power of the indicators using feature selection method
# Generate Features (Technical Indicators)
features = pd.DataFrame(index=stk_data.index) # Empty DataFrame to store features
# MSFT-specific indicators
# Relative Strength Index (RSI)
features['MSFT_RSI'] = ta.RSI(stk_data.loc[:, ('Adj Close', 'MSFT')], timeperiod=14)
# Moving Average Convergence Divergence (MACD)
features['MSFT_MACD'], features['MSFT_MACD_signal'], features['MSFT_MACD_hist'] = ta.MACD(
stk_data.loc[:, ('Adj Close', 'MSFT')], fastperiod=12, slowperiod=26, signalperiod=9)
# Bollinger Bands
features['MSFT_BB_upper'], features['MSFT_BB_middle'], features['MSFT_BB_lower'] = ta.BBANDS(
stk_data.loc[:, ('Adj Close', 'MSFT')], timeperiod=20, nbdevup=2, nbdevdn=2, matype=0)
# Average True Range (ATR)
features['MSFT_ATR'] = ta.ATR(
stk_data.loc[:, ('High', 'MSFT')], stk_data.loc[:, ('Low', 'MSFT')],
stk_data.loc[:, ('Adj Close', 'MSFT')], timeperiod=14)
# Stochastic Oscillator
features['MSFT_STOCH_K'], features['MSFT_STOCH_D'] = ta.STOCH(
stk_data.loc[:, ('High', 'MSFT')], stk_data.loc[:, ('Low', 'MSFT')], stk_data.loc[:, ('Adj Close', 'MSFT')],
fastk_period=14, slowk_period=3, slowk_matype=0, slowd_period=3, slowd_matype=0)
# On-Balance Volume (OBV)
features['MSFT_OBV'] = ta.OBV(stk_data.loc[:, ('Adj Close', 'MSFT')], stk_data.loc[:, ('Volume', 'MSFT')])
# Custom indicators
# Logarithmic returns
features['MSFT_Log_Return'] = np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff()
# Short and long moving averages
features['MSFT_MA_short'] = stk_data.loc[:, ('Adj Close', 'MSFT')].rolling(window=10).mean()
features['MSFT_MA_long'] = stk_data.loc[:, ('Adj Close', 'MSFT')].rolling(window=50).mean()
# Oscillator (difference between short and long moving averages)
features['MSFT_Oscillator'] = features['MSFT_MA_short'] - features['MSFT_MA_long']
Apply input processing to make the technical indicators more robust and informative. We introduce two process to achieve this
# IQR Stabilization
rolling_window = 25
for col in features.columns:
quantile_25 = features[col].rolling(window=rolling_window).quantile(0.25)
quantile_75 = features[col].rolling(window=rolling_window).quantile(0.75)
iqr = quantile_75 - quantile_25
features[col] = (features[col] - features[col].rolling(window=rolling_window).mean()) / iqr
# Focus on Tail Information
for col in features.columns:
lower_quantile = features[col].quantile(0.10)
upper_quantile = features[col].quantile(0.90)
features[f'{col}_LowerTail'] = np.where(features[col] < lower_quantile, features[col], 0)
features[f'{col}_UpperTail'] = np.where(features[col] > upper_quantile, features[col], 0)
# Concat with the original features
X = pd.concat([X1, X2, X3, X4, features], axis=1)
# Try different method to handle the missing data
dataset = pd.concat([Y, X], axis=1).dropna().iloc[::return_period, :] # Drop rows with NaN values
# dataset = pd.concat([Y, X], axis=1).fillna(0).iloc[::return_period, :] # Fill NaN values with 0
# dataset = pd.concat([Y, X], axis=1).interpolate().iloc[::return_period, :] # Interpolate missing values
# dataset = pd.concat([Y, X], axis=1).fillna(method='ffill').fillna(method='bfill').iloc[::return_period, :]
Y = dataset.loc[:, Y.name]
X = dataset.loc[:, X.columns]
# Check if the dataset has NA value
if all(dataset.isna().sum()==0):
print("No NA value contained in the dataset")
else:
print("NA value is contained in the dataset. Go back and revised")
# Check the total samples of the dataset
print(f"The total samples in the dataset: {len(dataset)}")
No NA value contained in the dataset The total samples in the dataset: 222
Lets have a look at the dataset we have
# pd.set_option('precision', 3) #This isn't working anymore
pd.set_option('display.float_format', '{:.3f}'.format)
dataset.describe()
| MSFT_pred | GOOGL | IBM | DEXJPUS | DEXUSUK | SP500 | DJIA | VIXCLS | MSFT_DT | MSFT_3DT | ... | MSFT_OBV_LowerTail | MSFT_OBV_UpperTail | MSFT_Log_Return_LowerTail | MSFT_Log_Return_UpperTail | MSFT_MA_short_LowerTail | MSFT_MA_short_UpperTail | MSFT_MA_long_LowerTail | MSFT_MA_long_UpperTail | MSFT_Oscillator_LowerTail | MSFT_Oscillator_UpperTail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | ... | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 | 222.000 |
| mean | 0.004 | 0.005 | 0.006 | 0.001 | 0.000 | 0.004 | 0.004 | -0.005 | 0.005 | 0.014 | ... | -0.125 | 0.121 | -0.193 | 0.156 | -0.182 | 0.276 | -0.174 | 0.172 | -0.204 | 0.247 |
| std | 0.032 | 0.040 | 0.034 | 0.013 | 0.012 | 0.026 | 0.025 | 0.143 | 0.033 | 0.059 | ... | 0.419 | 0.478 | 0.570 | 0.472 | 0.732 | 0.874 | 0.530 | 0.540 | 0.669 | 0.834 |
| min | -0.078 | -0.119 | -0.125 | -0.042 | -0.041 | -0.082 | -0.071 | -0.542 | -0.078 | -0.149 | ... | -2.434 | 0.000 | -2.811 | 0.000 | -7.481 | 0.000 | -3.525 | 0.000 | -5.161 | 0.000 |
| 25% | -0.017 | -0.020 | -0.012 | -0.005 | -0.007 | -0.008 | -0.009 | -0.097 | -0.016 | -0.020 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 50% | 0.004 | 0.007 | 0.006 | 0.001 | -0.001 | 0.004 | 0.003 | -0.010 | 0.006 | 0.018 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 75% | 0.027 | 0.028 | 0.025 | 0.008 | 0.007 | 0.018 | 0.017 | 0.079 | 0.026 | 0.050 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| max | 0.087 | 0.110 | 0.175 | 0.042 | 0.075 | 0.160 | 0.183 | 0.532 | 0.164 | 0.245 | ... | 0.000 | 3.049 | 0.000 | 2.687 | 0.000 | 4.511 | 0.000 | 3.492 | 0.000 | 5.066 |
8 rows × 57 columns
dataset.head()
| MSFT_pred | GOOGL | IBM | DEXJPUS | DEXUSUK | SP500 | DJIA | VIXCLS | MSFT_DT | MSFT_3DT | ... | MSFT_OBV_LowerTail | MSFT_OBV_UpperTail | MSFT_Log_Return_LowerTail | MSFT_Log_Return_UpperTail | MSFT_MA_short_LowerTail | MSFT_MA_short_UpperTail | MSFT_MA_long_LowerTail | MSFT_MA_long_UpperTail | MSFT_Oscillator_LowerTail | MSFT_Oscillator_UpperTail | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-03-30 | 0.031 | 0.084 | 0.175 | -0.030 | 0.075 | 0.160 | 0.183 | -0.076 | 0.164 | 0.062 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -1.314 | 0.000 | 0.000 | 0.000 |
| 2020-04-06 | 0.050 | 0.032 | 0.017 | 0.010 | -0.008 | 0.014 | 0.016 | -0.232 | 0.031 | 0.199 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 2020-04-14 | -0.034 | 0.067 | 0.076 | -0.017 | 0.022 | 0.068 | 0.056 | -0.212 | 0.050 | 0.245 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 2020-04-22 | 0.022 | 0.001 | 0.005 | 0.004 | -0.015 | 0.006 | -0.001 | 0.028 | 0.009 | 0.096 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -1.140 | 0.000 | 0.000 | 0.000 |
| 2020-04-29 | 0.028 | 0.064 | 0.076 | -0.011 | 0.008 | 0.049 | 0.048 | -0.296 | 0.022 | 0.082 | ... | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | -2.107 | 0.000 | 0.000 | 0.000 |
5 rows × 57 columns
Next, lets look at the distribution of the data over the entire period
# Number of columns in the dataset
num_columns = dataset.shape[1]
# Calculate grid size
ncols = 6 # Number of columns in the grid
nrows = (num_columns + ncols - 1) // ncols # Calculate the required number of rows
# Create the figure and axes
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 15)) # Adjust figure size
axes = axes.flatten() # Flatten axes for easy iteration
# Plot each column as a histogram
for i, column in enumerate(dataset.columns):
ax = axes[i]
dataset[column].hist(bins=50, ax=ax, alpha=0.75)
ax.set_title(column, fontsize=10)
ax.tick_params(axis='x', rotation=45, labelsize=8)
ax.tick_params(axis='y', labelsize=8)
# Hide extra axes (if any)
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Adjust layout to prevent overlaps
plt.tight_layout()
plt.show()
The above histogram shows the distribution for each series individually. Next, lets look at the density distribution over the same x axis scale.
# Number of columns in the dataset
num_columns = dataset.shape[1]
# Calculate grid size
ncols = 6 # Number of columns per row
nrows = (num_columns + ncols - 1) // ncols # Calculate the number of rows
# Create the figure and axes
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 15)) # Adjust figure size for density plots
axes = axes.flatten() # Flatten the axes array for easy iteration
# Plot density for each column
for i, column in enumerate(dataset.columns):
ax = axes[i]
dataset[column].plot(kind='density', ax=ax, alpha=0.75)
ax.set_title(column, fontsize=10)
ax.tick_params(axis='x', rotation=45, labelsize=8)
ax.tick_params(axis='y', labelsize=8)
# Hide unused subplots (if any)
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
# Adjust layout
plt.tight_layout()
plt.show()
In order to get a sense of the interdependence of the data we look at the scatter plot and the correlation matrix
# Compute the correlation matrix
correlation = dataset.corr()
# Set up the matplotlib figure
plt.figure(figsize=(25, 20)) # Increase figure size for better readability
# Draw the heatmap with customizations
sns.heatmap(
correlation,
annot=True, # Annotate with correlation coefficients
fmt=".2f", # Format annotations to 2 decimal places
cmap="coolwarm", # Use a visually clear colormap
vmax=1, vmin=-1, # Set color scale limits
square=True, # Make each cell square
linewidths=0.5, # Add space between cells
annot_kws={"size": 8}, # Adjust annotation font size
cbar_kws={"shrink": 0.8}, # Adjust colorbar size
)
# Customize the title and axis labels
plt.title("Correlation Matrix", fontsize=18, pad=20)
plt.xticks(fontsize=10, rotation=45, ha="right") # Rotate x-axis labels
plt.yticks(fontsize=10, rotation=0) # Rotate y-axis labels
# Display the heatmap
plt.tight_layout()
plt.show()
Looking at the correlation plot above, we see some correlation of the predicted vari‐ able with the lagged 5 days, 15days, 30 days and 60 days return of MSFT.
# Select a subset of columns (optional)
selected_columns = dataset.columns[:10] # For example, select the first 10 columns
data_to_plot = dataset[selected_columns]
# Create the scatter matrix
scatter_matrix(
data_to_plot,
figsize=(18, 18), # Increase figure size for better readability
diagonal='kde', # Use Kernel Density Estimation (KDE) for histograms on the diagonal
marker='o', # Use circles for scatter plots
s=15, # Marker size
alpha=0.7, # Marker transparency
color='blue', # Marker color
hist_kwds={'bins': 20, 'color': 'lightblue'}, # Customize histogram appearance
grid=True, # Add gridlines for better separation
)
# Improve layout
plt.tight_layout()
plt.show()
Looking at the scatter plot above, we see some linear relationship of the predicted variable the lagged 15 days, 30 days and 60 days return of MSFT.
Next, we look at the seasonal decomposition of our time series
res = sm.tsa.seasonal_decompose(Y,period=52)
fig = res.plot()
fig.set_figheight(8)
fig.set_figwidth(15)
pyplot.show()
Instead, the decomposition highlights periodic seasonality and residual randomness, which can be modeled or accounted for in predictive models. Further statistical tests (e.g., Augmented Dickey-Fuller) can confirm the stationarity of the series.
In this section, we will be continue using the indicators that are used in the seed code. The feature selection method is applied to all the technical indicators to select the top informative feawtures.
There are 4 methods used in this section:
# Set the global seed
seed = 7
np.random.seed(7)
# Test to use which percentile scorer
scorer = lambda X, y: mutual_info_regression(X, y, random_state=202)
selector = SelectPercentile(mutual_info_regression, percentile=10)
features_kbest = selector.fit_transform(X[features.columns], Y)
print(f"Original number of features: {features.shape[1]}")
print(f"New number of features: {features_kbest.shape[1]}")
Original number of features: 45 New number of features: 5
arr = np.transpose(selector.get_support())
selected_technical_features = features.columns[arr]
# Select the features that are not selected
unselect_technical_features = features.columns[np.where(selector.get_support() == False)
]
print("Index of best features: ", np.asarray(arr == True).nonzero())
print("Best features: ", selected_technical_features)
# Drop the unselect features using the master dataset (dataset)
dataset_filtered = dataset.drop(columns=unselect_technical_features)
X_mutual_info_regression = dataset_filtered.loc[:, dataset_filtered.columns != Y.name]
Y_mutual_info_regression = dataset_filtered.loc[:, dataset_filtered.columns == Y.name]
Index of best features: (array([ 0, 3, 5, 15, 23]),)
Best features: Index(['MSFT_RSI', 'MSFT_MACD_hist', 'MSFT_BB_middle', 'MSFT_RSI_LowerTail',
'MSFT_BB_upper_LowerTail'],
dtype='object')
# Test to use which percentile scorer
selector = SelectPercentile(f_regression, percentile=10)
# selector = SelectPercentile(mutual_info_regression, percentile=10)
features_kbest = selector.fit_transform(X[features.columns], Y)
print(f"Original number of features: {features.shape[1]}")
print(f"New number of features: {features_kbest.shape[1]}")
Original number of features: 45 New number of features: 5
arr = np.transpose(selector.get_support())
selected_technical_features = features.columns[arr]
# Select the features that are not selected
unselect_technical_features = features.columns[np.where(selector.get_support() == False)
]
print("Index of best features: ", np.asarray(arr == True).nonzero())
print("Best features: ", selected_technical_features)
# Drop the unselect features using the master dataset (dataset)
dataset_filtered = dataset.drop(columns=unselect_technical_features)
X_f_regression = dataset_filtered.loc[:, dataset_filtered.columns != Y.name]
Y_f_regression = dataset_filtered.loc[:, dataset_filtered.columns == Y.name]
Index of best features: (array([ 6, 17, 27, 29, 37]),)
Best features: Index(['MSFT_BB_lower', 'MSFT_MACD_LowerTail', 'MSFT_BB_lower_LowerTail',
'MSFT_ATR_LowerTail', 'MSFT_Log_Return_LowerTail'],
dtype='object')
# Assuming X is your input features dataframe
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA without specifying n_components to get all components
pca_full = PCA()
pca_full.fit(X_scaled)
# Calculate cumulative explained variance
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.grid(True)
plt.show()
The actual implementation of using PCA is integreated in the pipeline (Shown in section 5)
model = RandomForestRegressor(random_state=seed)
model.fit(X[features.columns], Y)
# Feature importance
feature_importances = pd.DataFrame({'feature': features.columns,
'importance': model.feature_importances_})
feature_importances.sort_values(by='importance', ascending=False, inplace=True)
print(feature_importances)
feature importance 12 MSFT_MA_short 0.073 7 MSFT_ATR 0.070 5 MSFT_BB_middle 0.068 11 MSFT_Log_Return 0.063 6 MSFT_BB_lower 0.062 4 MSFT_BB_upper 0.061 13 MSFT_MA_long 0.058 10 MSFT_OBV 0.056 1 MSFT_MACD 0.054 0 MSFT_RSI 0.053 2 MSFT_MACD_signal 0.051 14 MSFT_Oscillator 0.051 9 MSFT_STOCH_D 0.045 3 MSFT_MACD_hist 0.043 8 MSFT_STOCH_K 0.029 29 MSFT_ATR_LowerTail 0.021 24 MSFT_BB_upper_UpperTail 0.014 37 MSFT_Log_Return_LowerTail 0.014 19 MSFT_MACD_signal_LowerTail 0.013 43 MSFT_Oscillator_LowerTail 0.010 22 MSFT_MACD_hist_UpperTail 0.009 25 MSFT_BB_middle_LowerTail 0.006 27 MSFT_BB_lower_LowerTail 0.006 34 MSFT_STOCH_D_UpperTail 0.006 41 MSFT_MA_long_LowerTail 0.005 28 MSFT_BB_lower_UpperTail 0.005 23 MSFT_BB_upper_LowerTail 0.005 26 MSFT_BB_middle_UpperTail 0.005 16 MSFT_RSI_UpperTail 0.004 15 MSFT_RSI_LowerTail 0.004 21 MSFT_MACD_hist_LowerTail 0.004 33 MSFT_STOCH_D_LowerTail 0.004 17 MSFT_MACD_LowerTail 0.004 30 MSFT_ATR_UpperTail 0.003 44 MSFT_Oscillator_UpperTail 0.003 35 MSFT_OBV_LowerTail 0.003 32 MSFT_STOCH_K_UpperTail 0.003 39 MSFT_MA_short_LowerTail 0.003 42 MSFT_MA_long_UpperTail 0.002 31 MSFT_STOCH_K_LowerTail 0.002 20 MSFT_MACD_signal_UpperTail 0.002 18 MSFT_MACD_UpperTail 0.001 36 MSFT_OBV_UpperTail 0.001 38 MSFT_Log_Return_UpperTail 0.001 40 MSFT_MA_short_UpperTail 0.001
# Select the top 5 features accordingly
top_features = feature_importances['feature'].head(5).tolist()
unselect_techincal_features = feature_importances['feature'].iloc[5:].tolist()
print("Top Selected Features:", top_features)
print("Unselected Features:", unselect_techincal_features)
# Drop the unselect features using the master dataset (dataset)
dataset_filtered = dataset.drop(columns=unselect_technical_features)
X_rf_importance = dataset_filtered.loc[:, dataset_filtered.columns != Y.name]
Y_rf_importance = dataset_filtered.loc[:, dataset_filtered.columns == Y.name]
Top Selected Features: ['MSFT_MA_short', 'MSFT_ATR', 'MSFT_BB_middle', 'MSFT_Log_Return', 'MSFT_BB_lower'] Unselected Features: ['MSFT_BB_upper', 'MSFT_MA_long', 'MSFT_OBV', 'MSFT_MACD', 'MSFT_RSI', 'MSFT_MACD_signal', 'MSFT_Oscillator', 'MSFT_STOCH_D', 'MSFT_MACD_hist', 'MSFT_STOCH_K', 'MSFT_ATR_LowerTail', 'MSFT_BB_upper_UpperTail', 'MSFT_Log_Return_LowerTail', 'MSFT_MACD_signal_LowerTail', 'MSFT_Oscillator_LowerTail', 'MSFT_MACD_hist_UpperTail', 'MSFT_BB_middle_LowerTail', 'MSFT_BB_lower_LowerTail', 'MSFT_STOCH_D_UpperTail', 'MSFT_MA_long_LowerTail', 'MSFT_BB_lower_UpperTail', 'MSFT_BB_upper_LowerTail', 'MSFT_BB_middle_UpperTail', 'MSFT_RSI_UpperTail', 'MSFT_RSI_LowerTail', 'MSFT_MACD_hist_LowerTail', 'MSFT_STOCH_D_LowerTail', 'MSFT_MACD_LowerTail', 'MSFT_ATR_UpperTail', 'MSFT_Oscillator_UpperTail', 'MSFT_OBV_LowerTail', 'MSFT_STOCH_K_UpperTail', 'MSFT_MA_short_LowerTail', 'MSFT_MA_long_UpperTail', 'MSFT_STOCH_K_LowerTail', 'MSFT_MACD_signal_UpperTail', 'MSFT_MACD_UpperTail', 'MSFT_OBV_UpperTail', 'MSFT_Log_Return_UpperTail', 'MSFT_MA_short_UpperTail']
Next, we start by splitting our data in training and testing chunks. If we are going to use Time series models we have to split the data in continous series.
In this helper function train_test_split(), we are able to prepare the training and testing chunks according to the applied feature selector method
# Make this a function for reproducibility
def train_test_split(feature_selector_method, test_size=0.2):
validation_size = test_size
feature_selector = ['f_regression', 'mutual_info_regression', 'pca', 'random_forest_importance']
feature_selector_method = feature_selector_method.lower()
x_y_data = [(X_f_regression, Y_f_regression), (X_mutual_info_regression, Y_mutual_info_regression), (X, Y), (X_rf_importance, Y_rf_importance)]
X_filtered, Y_filtered = x_y_data[feature_selector.index(feature_selector_method)]
train_size = int(len(X_filtered) * (1-validation_size))
X_train, X_test = X_filtered[0:train_size], X_filtered[train_size:len(X)]
Y_train, Y_test = Y_filtered[0:train_size], Y_filtered[train_size:len(X)]
return X_train, X_test, Y_train, Y_test
num_folds = 10
# scikit is moving away from mean_squared_error.
# In order to avoid confusion, and to allow comparison with other models, we invert the final scores
scoring = 'neg_mean_squared_error'
# Set random seed for reproducibility
models = []
models.append(('LR', LinearRegression()))
models.append(('LASSO', Lasso(random_state=seed)))
models.append(('EN', ElasticNet(random_state=seed)))
models.append(('KNN', KNeighborsRegressor()))
models.append(('CART', DecisionTreeRegressor(random_state=seed)))
models.append(('SVR', SVR()))
models.append(('MLP', MLPRegressor(random_state=seed)))
# Boosting methods
models.append(('ABR', AdaBoostRegressor(random_state=seed)))
models.append(('GBR', GradientBoostingRegressor(random_state=seed)))
# Bagging methods
models.append(('RFR', RandomForestRegressor(random_state=seed)))
models.append(('ETR', ExtraTreesRegressor(random_state=seed)))
Once we have selected all the models, we loop over each of them.
The loop includes:
Then we find the most robust model by evaluating
feature_selector = ['f_regression', 'mutual_info_regression', 'pca', 'random_forest_importance']
lowest_test_results = {}
kfold_results = {}
train_results = {}
test_results = {}
for feature_selector_method in feature_selector:
print("\n" + feature_selector_method)
X_train, X_test, Y_train, Y_test = train_test_split(feature_selector_method)
names = []
feature_selector_kfold_results = {}
feature_selector_test_results = {}
feature_selector_train_results = {}
for name, model in models:
names.append(name)
if feature_selector_method == 'pca':
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.85, random_state=seed)),
('model', model)
])
model = pipeline
# kfold = KFold(n_splits=num_folds, random_state=seed)
kfold = KFold(n_splits=num_folds, shuffle=False)
#converted mean square error to positive. The lower the beter
cv_results = -1* cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
feature_selector_kfold_results[name] = (cv_results)
# Full Training period
res = model.fit(X_train, Y_train)
train_result = mean_squared_error(res.predict(X_train), Y_train)
feature_selector_train_results[name] = train_result
# Test results
test_result = mean_squared_error(res.predict(X_test), Y_test)
feature_selector_test_results[name] = test_result
msg = "%s: %f (%f) %f %f" % (name, cv_results.mean(), cv_results.std(), train_result, test_result)
print(msg)
# After finish all the models store the result
kfold_results[feature_selector_method] = feature_selector_kfold_results
train_results[feature_selector_method] = feature_selector_train_results
test_results[feature_selector_method] = feature_selector_test_results
print(f"Lowest Test Results: {names[np.argmin(np.array(list(feature_selector_test_results.values())))]}, {np.min(np.array(list(feature_selector_test_results.values())))}")
lowest_test_results[feature_selector_method] = (
# Model name
names[np.argmin(np.array(list(feature_selector_test_results.values())))],
# Training Error
train_results[feature_selector_method][names[np.argmin(np.array(list(feature_selector_test_results.values())))]],
# Testing Error
np.min(np.array(list(feature_selector_test_results.values()))))
# # Find the lowest test results
# selector = list(lowest_test_results.keys())[np.argmin(np.array([i[1] for i in lowest_test_results.values()]))]
# print(f"\nLowest Test Results: ({selector}, {lowest_test_results[selector]})")
f_regression LR: 0.001277 (0.000473) 0.001041 0.000638 LASSO: 0.001126 (0.000497) 0.001126 0.000738 EN: 0.001126 (0.000497) 0.001126 0.000738
File "c:\Users\User\anaconda3\envs\APS1052_Final\lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
raise ValueError(f"found {cpu_count_physical} physical cores < 1")
KNN: 0.001408 (0.000617) 0.000935 0.000880 CART: 0.002208 (0.000962) 0.000000 0.002294 SVR: 0.001121 (0.000491) 0.001127 0.000736 MLP: 0.004008 (0.002346) 0.002357 0.002482 ABR: 0.001360 (0.000633) 0.000604 0.000937 GBR: 0.001513 (0.000701) 0.000110 0.001084 RFR: 0.001285 (0.000556) 0.000173 0.000797 ETR: 0.001295 (0.000597) 0.000000 0.000847 Lowest Test Results: LR, 0.0006375943937565235 mutual_info_regression LR: 0.001250 (0.000454) 0.001053 0.000693 LASSO: 0.001126 (0.000497) 0.001126 0.000738 EN: 0.001126 (0.000497) 0.001126 0.000738 KNN: 0.001408 (0.000541) 0.000985 0.000695 CART: 0.002820 (0.001028) 0.000000 0.001399 SVR: 0.001121 (0.000491) 0.001127 0.000736 MLP: 0.003345 (0.002330) 0.002250 0.003324 ABR: 0.001333 (0.000599) 0.000536 0.000835 GBR: 0.001625 (0.000715) 0.000097 0.001071 RFR: 0.001295 (0.000603) 0.000169 0.000818 ETR: 0.001286 (0.000615) 0.000000 0.000697 Lowest Test Results: LR, 0.0006932589371241327 pca LR: 0.001249 (0.000499) 0.001007 0.000675 LASSO: 0.001126 (0.000497) 0.001126 0.000738 EN: 0.001126 (0.000497) 0.001126 0.000738 KNN: 0.001316 (0.000592) 0.000904 0.000783 CART: 0.002329 (0.000521) 0.000000 0.001995 SVR: 0.001121 (0.000491) 0.001127 0.000736 MLP: 0.054469 (0.029901) 0.005670 0.055252 ABR: 0.001231 (0.000536) 0.000487 0.000640 GBR: 0.001356 (0.000597) 0.000069 0.000771 RFR: 0.001211 (0.000521) 0.000172 0.000616 ETR: 0.001300 (0.000533) 0.000000 0.000639 Lowest Test Results: RFR, 0.0006162323521441384 random_forest_importance LR: 0.001277 (0.000473) 0.001041 0.000638 LASSO: 0.001126 (0.000497) 0.001126 0.000738 EN: 0.001126 (0.000497) 0.001126 0.000738 KNN: 0.001408 (0.000617) 0.000935 0.000880 CART: 0.002208 (0.000962) 0.000000 0.002294 SVR: 0.001121 (0.000491) 0.001127 0.000736 MLP: 0.004008 (0.002346) 0.002357 0.002482 ABR: 0.001360 (0.000633) 0.000604 0.000937 GBR: 0.001513 (0.000701) 0.000110 0.001084 RFR: 0.001285 (0.000556) 0.000173 0.000797 ETR: 0.001295 (0.000597) 0.000000 0.000847 Lowest Test Results: LR, 0.0006375943937565235
# Find the best model accounting the testing error and overfitting
def overfitting_ratio(training_error, testing_error):
return np.abs((training_error-testing_error))/training_error
def balance_testerror_overfitting(w1=0.8, w2=0.2, test_error=None, overfit_ratio=None):
penalty = overfit_ratio
model_score = w1*test_error-w2*penalty
return model_score
best_model_score = -np.inf
best_balance_model = [0,0]
for (selector, model) in lowest_test_results.items():
model_name, model_train_error, model_test_error = model
overfit_ratio = overfitting_ratio(model_train_error, model_test_error)
model_score = balance_testerror_overfitting(test_error = model_test_error, overfit_ratio=overfit_ratio)
if model_score>best_model_score:
best_model_score = model_score
best_balance_model[0], best_balance_model[1] = selector, model_name
print(f"The best balanced model: {best_balance_model}")
The best balanced model: ['mutual_info_regression', 'LR']
The best model can be concluded as using feature selection "PCA" and model "ABR"
We being by looking at the K Fold results when using PCA as feature selector
# Clip the y-axis
fig = plt.figure(figsize=(15, 8))
fig.suptitle('Algorithm Comparison: K-Fold Results (Clipped Y-Axis)', fontsize=16)
ax = fig.add_subplot(111)
ax.boxplot(kfold_results['pca'].values(), boxprops=dict(color='blue', linewidth=2), medianprops=dict(color='orange', linewidth=2))
# Set y-axis limits
ax.set_ylim([0, 0.01]) # Adjust the range to focus on smaller values
ax.set_xticklabels(names, fontsize=12, rotation=45, ha='right')
ax.set_ylabel('Performance Metric', fontsize=14)
ax.yaxis.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
The graph is cilp at error of 0.01 since we are only interested in models that has relatively lower error.
From the graph we can conclude that besides "MLP", all other models perform relatively well
We can tell that using PCA as feature selector and using Machine Learning model "ABR" gives the best result. Therefore, we will plot the training and testing error using PCA as feature selector and compare with all other machine learning models
# Set up x locations and bar width
ind = np.arange(len(names)) # x locations
width = 0.35 # width of bars
# Create the plot
fig = plt.figure(figsize=(15, 8))
fig.suptitle('Algorithm Comparison (Clipped Y-Axis)', fontsize=16)
ax = fig.add_subplot(111)
# Bar plots for train and test errors
ax.bar(ind - width / 2, train_results['pca'].values(), width=width, label='Train Error')
ax.bar(ind + width / 2, test_results['pca'].values(), width=width, label='Test Error')
# Set x-axis labels and legend
ax.set_xticks(ind)
ax.set_xticklabels(names, fontsize=12, rotation=45, ha='right')
ax.legend(fontsize=12)
# Clip the y-axis
ax.set_ylim([0, 0.01]) # Focus on smaller values, clipping MLP's large error
# Add grid lines
ax.yaxis.grid(True, linestyle='--', alpha=0.7)
# Improve layout
plt.tight_layout()
plt.show()
The training error is also clipped at 0.01 due to the same reason that we are only interested in the models that have lower error
Looking at the training and test error, we still see a better performance of the linear models. Some of the algorithms, such as the decision tree regressor (CART) overfit on the training data and produced very high error on the test set and these models should be avoided. Ensemble models, such as gradient boosting regression (GBR) and random forest regression (RFR) have low bias but high variance. We also see that the artificial neural network (shown as MLP is the chart) algorithm shows higher errors both in training set and test set, which is perhaps due to the linear relationship of the variables not captured accurately by ANN or improper hyperparameters or insuffi‐ cient training of the model.
For ARIMA model we use auto arima to automatically search through the parameters. In the ARIMA model pipeline, we also prepare the dataset to only having the correlated variables as exogenous variables
error_Training_ARIMA = {}
error_Test_ARIMA = {}
feature_selector = ['f_regression', 'mutual_info_regression', 'pca', 'random_forest_importance']
for feature_selector_method in feature_selector:
print("\n" + feature_selector_method)
X_train, X_test, Y_train, Y_test = train_test_split(feature_selector_method)
correlation_matrix = X_train.corrwith(Y_train)
# Sort features by absolute correlation
selected_features = correlation_matrix.sort_values(ascending=False)[:8]
selected_features = selected_features.index.tolist()
print("Top Selected Features:", selected_features)
# Find the index with respect to the column
selected_indices = [list(X_train.columns).index(feature) for feature in selected_features]
selected_indices
X_train_ARIMA=X_train.iloc[:, selected_indices]
X_test_ARIMA=X_test.iloc[:, selected_indices]
tr_len = len(X_train_ARIMA)
te_len = len(X_test_ARIMA)
to_len = len(X)
# modelARIMA=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=[1,0,0])
# Use auto_arima to find the best parameters
model_fit = auto_arima(
Y_train,
exogenous=X_train_ARIMA,
start_p=0,
start_q=0,
max_p=5, # Wider range for AR terms
max_q=5, # Wider range for MA terms
seasonal=True, # Enable seasonal components
m=52, # Seasonal period (e.g., monthly data)
start_P=0, # Start for seasonal AR terms
start_Q=0, # Start for seasonal MA terms
max_P=3, # Max seasonal AR terms
max_Q=3, # Max seasonal MA terms
stepwise=True, # Stepwise search
trace=True, # Display progress
error_action="ignore",
suppress_warnings=True,
random_state=42
)
# model_fit = modelARIMA.fit()
# print(model_fit.summary())
train_predictions = model_fit.predict_in_sample(start=0, end=len(Y_train) - 1)
error_Training_ARIMA[feature_selector_method] = mean_squared_error(Y_train, train_predictions)
predicted = model_fit.predict(X_test_ARIMA.shape[0])
error_Test_ARIMA[feature_selector_method] = mean_squared_error(Y_test,predicted)
print(f"ARIMA Train Error: {error_Training_ARIMA[feature_selector_method]}")
print(f"ARIMA Test Error: {error_Test_ARIMA[feature_selector_method]}")
f_regression Top Selected Features: ['DEXJPUS', 'DEXUSUK', 'DJIA', 'GOOGL', 'IBM', 'MSFT_12DT', 'MSFT_3DT', 'MSFT_6DT'] Performing stepwise search to minimize aic ARIMA(0,0,0)(0,0,0)[52] intercept : AIC=-695.286, Time=0.14 sec ARIMA(1,0,0)(1,0,0)[52] intercept : AIC=-691.705, Time=1.11 sec ARIMA(0,0,1)(0,0,1)[52] intercept : AIC=-691.695, Time=0.90 sec ARIMA(0,0,0)(0,0,0)[52] : AIC=-693.327, Time=0.02 sec ARIMA(0,0,0)(1,0,0)[52] intercept : AIC=-693.372, Time=1.05 sec ARIMA(0,0,0)(0,0,1)[52] intercept : AIC=-693.380, Time=0.71 sec ARIMA(0,0,0)(1,0,1)[52] intercept : AIC=-691.362, Time=0.67 sec ARIMA(1,0,0)(0,0,0)[52] intercept : AIC=-693.635, Time=0.11 sec ARIMA(0,0,1)(0,0,0)[52] intercept : AIC=-693.619, Time=0.05 sec ARIMA(1,0,1)(0,0,0)[52] intercept : AIC=-691.626, Time=0.14 sec Best model: ARIMA(0,0,0)(0,0,0)[52] intercept Total fit time: 4.944 seconds ARIMA Train Error: 0.0011264949054473111 ARIMA Test Error: 0.0007381617908776561 mutual_info_regression Top Selected Features: ['DEXJPUS', 'DEXUSUK', 'DJIA', 'GOOGL', 'IBM', 'MSFT_12DT', 'MSFT_3DT', 'MSFT_6DT'] Performing stepwise search to minimize aic ARIMA(0,0,0)(0,0,0)[52] intercept : AIC=-695.286, Time=0.05 sec ARIMA(1,0,0)(1,0,0)[52] intercept : AIC=-691.705, Time=1.02 sec ARIMA(0,0,1)(0,0,1)[52] intercept : AIC=-691.695, Time=0.91 sec ARIMA(0,0,0)(0,0,0)[52] : AIC=-693.327, Time=0.02 sec ARIMA(0,0,0)(1,0,0)[52] intercept : AIC=-693.372, Time=1.00 sec ARIMA(0,0,0)(0,0,1)[52] intercept : AIC=-693.380, Time=0.42 sec ARIMA(0,0,0)(1,0,1)[52] intercept : AIC=-691.362, Time=0.68 sec ARIMA(1,0,0)(0,0,0)[52] intercept : AIC=-693.635, Time=0.10 sec ARIMA(0,0,1)(0,0,0)[52] intercept : AIC=-693.619, Time=0.05 sec ARIMA(1,0,1)(0,0,0)[52] intercept : AIC=-691.626, Time=0.19 sec Best model: ARIMA(0,0,0)(0,0,0)[52] intercept Total fit time: 4.465 seconds ARIMA Train Error: 0.0011264949054473111 ARIMA Test Error: 0.0007381617908776561 pca Top Selected Features: ['MSFT_MA_long_UpperTail', 'MSFT_ATR_LowerTail', 'MSFT_OBV_LowerTail', 'MSFT_RSI_LowerTail', 'MSFT_MACD_hist', 'MSFT_MACD_LowerTail', 'MSFT_RSI_UpperTail', 'VIXCLS'] Performing stepwise search to minimize aic ARIMA(0,0,0)(0,0,0)[52] intercept : AIC=-695.286, Time=0.04 sec ARIMA(1,0,0)(1,0,0)[52] intercept : AIC=-691.705, Time=1.34 sec ARIMA(0,0,1)(0,0,1)[52] intercept : AIC=-691.695, Time=0.95 sec ARIMA(0,0,0)(0,0,0)[52] : AIC=-693.327, Time=0.02 sec ARIMA(0,0,0)(1,0,0)[52] intercept : AIC=-693.372, Time=0.94 sec ARIMA(0,0,0)(0,0,1)[52] intercept : AIC=-693.380, Time=0.42 sec ARIMA(0,0,0)(1,0,1)[52] intercept : AIC=-691.362, Time=0.62 sec ARIMA(1,0,0)(0,0,0)[52] intercept : AIC=-693.635, Time=0.10 sec ARIMA(0,0,1)(0,0,0)[52] intercept : AIC=-693.619, Time=0.11 sec ARIMA(1,0,1)(0,0,0)[52] intercept : AIC=-691.626, Time=0.13 sec Best model: ARIMA(0,0,0)(0,0,0)[52] intercept Total fit time: 4.693 seconds ARIMA Train Error: 0.0011264949054473111 ARIMA Test Error: 0.0007381617908776561 random_forest_importance Top Selected Features: ['DEXJPUS', 'DEXUSUK', 'DJIA', 'GOOGL', 'IBM', 'MSFT_12DT', 'MSFT_3DT', 'MSFT_6DT'] Performing stepwise search to minimize aic ARIMA(0,0,0)(0,0,0)[52] intercept : AIC=-695.286, Time=0.04 sec ARIMA(1,0,0)(1,0,0)[52] intercept : AIC=-691.705, Time=1.09 sec ARIMA(0,0,1)(0,0,1)[52] intercept : AIC=-691.695, Time=0.98 sec ARIMA(0,0,0)(0,0,0)[52] : AIC=-693.327, Time=0.02 sec ARIMA(0,0,0)(1,0,0)[52] intercept : AIC=-693.372, Time=1.14 sec ARIMA(0,0,0)(0,0,1)[52] intercept : AIC=-693.380, Time=0.63 sec ARIMA(0,0,0)(1,0,1)[52] intercept : AIC=-691.362, Time=0.68 sec ARIMA(1,0,0)(0,0,0)[52] intercept : AIC=-693.635, Time=0.11 sec ARIMA(0,0,1)(0,0,0)[52] intercept : AIC=-693.619, Time=0.06 sec ARIMA(1,0,1)(0,0,0)[52] intercept : AIC=-691.626, Time=0.13 sec Best model: ARIMA(0,0,0)(0,0,0)[52] intercept Total fit time: 4.916 seconds ARIMA Train Error: 0.0011264949054473111 ARIMA Test Error: 0.0007381617908776561
min = np.inf
best_arima_selector = ""
for selector, test_error in error_Test_ARIMA.items():
if test_error < min:
min = test_error
best_arima_selector = selector
print(f"Best ARIMA selector: {selector}\nTest Error: {min}")
Best ARIMA selector: random_forest_importance Test Error: 0.0007381617908776561
seq_len = 2 #Length of the seq for the LSTM
def generate_LSTM_data(feature_selector):
X_train, X_test, Y_train, Y_test = train_test_split(feature_selector)
total_X = pd.concat([X_train, X_test])
Y_train_LSTM, Y_test_LSTM = np.array(Y_train)[seq_len-1:], np.array(Y_test)
X_train_LSTM = np.zeros((X_train.shape[0]+1-seq_len, seq_len, X_train.shape[1]))
X_test_LSTM = np.zeros((X_test.shape[0], seq_len, total_X.shape[1]))
for i in range(seq_len):
X_train_LSTM[:, i, :] = np.array(X_train)[i:X_train.shape[0]+i+1-seq_len, :]
X_test_LSTM[:, i, :] = np.array(total_X)[X_train.shape[0]+i-1:total_X.shape[0]+i+1-seq_len, :]
return X_train_LSTM, X_test_LSTM, Y_train_LSTM, Y_test_LSTM
# Save weights during the training process to further access the weights
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf
from tensorflow.keras.regularizers import l2
tf.random.set_seed(seed)
feature_selector = ['f_regression', 'mutual_info_regression', 'pca', 'random_forest_importance']
# feature_selector = ['f_regression']
LSTM_train_error = {}
LSTM_test_error = {}
LSTM_models = {}
# Define the EarlyStopping callback
early_stopping = EarlyStopping(
monitor='val_loss', # Metric to monitor (validation loss in this case)
patience=50, # Number of epochs with no improvement to wait
restore_best_weights=True # Restore model weights from the epoch with the best value of the monitored metric
)
for selector in feature_selector:
X_train_LSTM, X_test_LSTM, Y_train_LSTM, Y_test_LSTM = generate_LSTM_data(selector)
# Lstm Network
def create_LSTMmodel(neurons=12, learn_rate = 0.01, momentum=float(0.0)):
# Define the Training, Testing Data
# create model
model = Sequential()
model.add(LSTM(50, input_shape=(X_train_LSTM.shape[1], X_train_LSTM.shape[2])))
#More number of cells can be added if needed
#Add regularizer
model.add(Dense(1, kernel_regularizer=l2(0.01)))
optimizer = SGD(learning_rate=learn_rate, momentum=float(momentum))
model.compile(loss='mse', optimizer='adam')
return model
LSTMModel = create_LSTMmodel(12, learn_rate = 0.01, momentum=0)
LSTMModel_fit = LSTMModel.fit(
X_train_LSTM, Y_train_LSTM,
validation_data=(X_test_LSTM, Y_test_LSTM),
epochs=300, # Maximum number of epochs
batch_size=44,
callbacks=[early_stopping], # Include EarlyStopping in callbacks
verbose=0,
shuffle=False
)
# Store the model
LSTM_models[selector] = LSTMModel
#Visual plot to check if the error is reducing
pyplot.plot(LSTMModel_fit.history['loss'], label='train')
pyplot.plot(LSTMModel_fit.history['val_loss'], label='test')
pyplot.title(f"Selector: {selector} Train and Validation Loss")
pyplot.legend()
pyplot.show()
# Training and Testing Error
error_Training_LSTM = mean_squared_error(Y_train_LSTM, LSTMModel.predict(X_train_LSTM))
predicted = LSTMModel.predict(X_test_LSTM)
error_Test_LSTM = mean_squared_error(Y_test,predicted)
LSTM_train_error[selector] = error_Training_LSTM
LSTM_test_error[selector] = error_Test_LSTM
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 50ms/step 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 22ms/step
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 28ms/step 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 19ms/step
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 45ms/step 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 44ms/step
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step 2/2 ━━━━━━━━━━━━━━━━━━━━ 0s 32ms/step
# Find the best model
min = np.inf
best_selector = ""
for selector, test_error in LSTM_test_error.items():
if test_error<min:
min = test_error
best_selector = selector
print(f"Best LSTM Model:\nSelector: {best_selector}\nTrain Error: {LSTM_train_error[best_selector]}\nTest Error: {LSTM_test_error[best_selector]}")
Best LSTM Model: Selector: mutual_info_regression Train Error: 0.0010033489083806039 Test Error: 0.0006990970866481242
lowest_test_results['pca']
('RFR', np.float64(0.00017176739182642764), np.float64(0.0006162323521441384))
# Find the best model for each method
# Machine Learning models (PCA ABR); ARIMA (f_regression); LSTM (mutual_info_regression)
models = ['Machine Learning (ABR-PCA)', 'ARIMA (random_forest_importance)', 'LSTM (mutual_info_regression)']
train_errors = [lowest_test_results['pca'][1], error_Training_ARIMA['random_forest_importance'], LSTM_train_error['mutual_info_regression'] ] # Replace with your values
test_errors = [lowest_test_results['pca'][2], error_Test_ARIMA['random_forest_importance'], LSTM_test_error['mutual_info_regression']] # Replace with your values
x = range(len(models))
# Create a bar plot
plt.figure(figsize=(10, 6))
plt.bar(x, train_errors, width=0.4, label='Train Error', align='center')
plt.bar(x, test_errors, width=0.4, label='Test Error', align='edge')
# Add labels, title, and legend
plt.xticks(x, models, rotation=15)
plt.ylabel('Error')
plt.title('Comparison of Train and Test Errors Across Models')
plt.legend()
# Show the plot
plt.tight_layout()
plt.show()
min_index = np.argmin(np.array(test_errors))
print(f"Best Results: {models[min_index]} - Validation Error: {test_errors[min_index]}")
Best Results: Machine Learning (ABR-PCA) - Validation Error: 0.0006162323521441384
Looking at the chart above, we find that all three models perform really well. By looking at the validation error, we can see that when using PCA as feature selector and using ABR as our machine leraning model lead to the lowest validation error.
In the next section, we will be tuning this model to find the optimal hyperparameter
As shown in the chart above, we will be using GridSearch to search through the parameter space of PCA and ABR. This will allow us to find the optimal paramters respectively
X_train, X_test, Y_train, Y_test = train_test_split('pca')
# Use grid search to search on ABR using PCA feature selector
# Define the pipeline
pipeline = Pipeline([
('scaler', StandardScaler()), # Standardize the data
('pca', PCA()), # PCA for dimensionality reduction
('abr', AdaBoostRegressor(random_state=42)) # AdaBoost Regressor
])
# Define the parameter grid for AdaBoost Regressor
param_grid = {
'pca__n_components': [0.85, 0.9, 0.95], # Variance explained thresholds
'abr__n_estimators': [50, 100, 200], # Number of weak learners (trees)
'abr__learning_rate': [0.01, 0.1, 1.0], # Learning rate
'abr__loss': ['linear', 'square', 'exponential'] # Loss functions
}
# Create GridSearchCV
grid_search = GridSearchCV(
pipeline,
param_grid,
scoring='neg_mean_squared_error', # Use MSE as the scoring metric
cv=5, # 5-fold cross-validation
n_jobs=-1, # Use all CPUs
verbose=2
)
# Fit the grid search
grid_search.fit(X_train, Y_train)
# Get the best parameters and model
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best Parameters: {'abr__learning_rate': 0.01, 'abr__loss': 'linear', 'abr__n_estimators': 200, 'pca__n_components': 0.85}
X_train, X_test, Y_train, Y_test = train_test_split('pca')
best_model = grid_search.best_estimator_
# Evaluate the training set
y_train_pred = best_model.predict(X_train)
train_mse = mean_squared_error(Y_train, y_train_pred)
print("Train MSE: ", train_mse)
# Evaluate on the test set
Y_pred = best_model.predict(X_test)
test_mse = mean_squared_error(Y_test, Y_pred)
# Print results
print("Test MSE:", test_mse)
Train MSE: 0.0007672865319608793 Test MSE: 0.0006473180544016745
prediction = best_model.predict(X_test)
actual = Y_test
x_range = Y_test.index
plt.plot(x_range, prediction, label="Predict")
plt.plot(x_range, actual, label = "actual")
plt.legend()
plt.grid(True)
plt.show()
feature_importances = best_model[2].feature_importances_
# Ensure you have the corresponding feature names (from your dataset)
feature_names = X_train.columns # Replace with the feature names used in training
# Sort the feature importances in descending order
sorted_indices = np.argsort(feature_importances)[::-1]
sorted_feature_importances = feature_importances[sorted_indices]
sorted_feature_names = feature_names[sorted_indices]
# Plot the feature importance
plt.figure(figsize=(10, 6))
plt.barh(sorted_feature_names, sorted_feature_importances)
plt.xlabel("Feature Importance")
plt.ylabel("Feature Names")
plt.title("Feature Importance for AdaBoost Regressor")
plt.gca().invert_yaxis() # Invert y-axis to show highest importance at the top
plt.show()
# Financial metrics functions
def calculate_profit_factor(returns):
profits = np.where((returns >= 0), returns, 0)
losses = np.where((returns < 0), returns, 0)
p_ratio = np.sum(profits)/np.sum(np.abs(losses))
return p_ratio
def calculate_cagr(returns):
cum_return = np.cumprod(returns+1)-1
# Calculate CAGR
cagr = (1 + cum_return[-1]) ** (52 / (len(cum_return))) - 1
return cagr
def calculate_sharpe_ratio(returns):
return (52 ** (1.0/2.0)) * np.mean(returns) / np.std(returns)
# Evaluate the best model with financial metrics
def evaluate_trading_metrics(predictions, actual_returns):
returns = actual_returns * predictions #This is log return
returns = np.exp(returns)-1 #This is percentage return
profit_factor = calculate_profit_factor(returns)
cagr = calculate_cagr(returns)
sharpe_ratio = calculate_sharpe_ratio(returns)
return profit_factor, cagr, sharpe_ratio
# Example workflow (replace with actual data and models)
pred = best_model.predict(X_test)
actual_returns = Y_test.values
predictions = np.where(pred>0, 1, -1)
# Calculate metrics
profit_factor, cagr, sharpe_ratio = evaluate_trading_metrics(predictions, actual_returns)
print(f"Profit Factor: {profit_factor:.2f}, CAGR: {cagr:.2%}, Sharpe Ratio: {sharpe_ratio:.2f}")
Profit Factor: 2.05, CAGR: 47.01%, Sharpe Ratio: 2.15
def detrendPrice(series):
# fit linear model
length = len(series)
x = np.arange(length)
y = np.array(series.values)
x_const = sm.add_constant(x) #need to add intercept constant
model = sm.OLS(y,x_const)
result = model.fit()
df = pd.DataFrame(result.params*x_const)
y_hat = df[0] + df[1]
#the residuals are the detrended prices
resid = y-y_hat
#add minimum necessary to residuals to avoid negative detrended prices
resid = resid + abs(resid.min() + 1/10*resid.min())
return resid
def bootstrap(ser):
ser.dropna(inplace=True)
# ser = np.log(ser+1)
arr = np.array(ser.values)
alpha = .05*100 #significance alpha
reps = 5000 #how many bootstrapings, 50000 limit if you have 8GB RAM
percentile = 100-alpha
ave = np.average(arr) #arithmetic mean
print("average return %f" %ave)
centered_arr = arr-ave
n = len(centered_arr)
#constructs 50000 alternative return histories and calculates their theoretical averages
xb = np.random.choice(centered_arr, (n, reps), replace=True)
mb = xb.mean(axis=0) #arithmetic mean
#sorts the 50000 averages
mb.sort()
#calculates the 95% conficence interval (two tails) threshold for the theoretical averages
print(np.percentile(mb, [2.5, 97.5]))
threshold = np.percentile(mb, [percentile])[0]
if ave > threshold:
print("Reject Ho = The population distribution of rule returns has an expected value of zero or less (because p_value is small enough)")
else:
print("Do not reject Ho = The population distribution of rule returns has an expected value of zero or less (because p_value is not small enough)")
#count will be the items i that are smaller than ave
count_vals = 0
for i in mb:
count_vals += 1
if i > ave:
break
#p is based on the count that are larger than ave so 1-count is needed:
p = 1-count_vals/len(mb)
print("p_value:")
print(p)
#histogram
sr = pd.Series(mb)
desc = sr.describe()
count = desc[0]
std = desc[2]
minim = desc[3]
maxim = desc[7]
R = maxim-minim
n = count
s = std
bins = int(round(R*(n**(1/3))/(3.49*std),0))
fig = sr.hist(bins=bins)
plt.axvline(x = ave, color = 'b', label = 'axvline - full height')
# plt.show()
# White Reality Check
predict = best_model.predict(X_test)
positions2 = np.where(predict > 0, 1, -1)
detrended_retFut1 = detrendPrice(Y_test)
# Convert positions2 into a Series with the same index as detrended_retFut1
positions_series = pd.Series(data=positions2, index=detrended_retFut1.index)
# Broadcast positions2 to all columns of detrended_retFut1 (row-wise multiplication)
detrended_syst_rets = detrended_retFut1.mul(positions_series, axis=0)
bootstrap(detrended_syst_rets)
average return 0.034005 [-0.01458913 0.01460803] Reject Ho = The population distribution of rule returns has an expected value of zero or less (because p_value is small enough) p_value: 0.0
# Monte Carlo Permutation
num_permutations = 1000
permutation_means = np.zeros(num_permutations)
ori_perdictions = best_model.predict(X_test)
position = np.where(ori_perdictions>0, 1,-1)
ori_return = position*Y_test
ori_mean_return = ori_return.mean()
count = 0
for i in range(num_permutations):
permute = np.random.permutation(X_test)
predictions = best_model.predict(permute)
positions = np.where(predictions > 0, 1, -1)
permutation_return = positions*Y_test
permutation_means[i] = permutation_return.mean()
if permutation_means[i] > ori_mean_return:
count+=1
p_val = (count+1)/(num_permutations+1)
print(f"Monte Carlo Permutation p-value: {p_val}")
Monte Carlo Permutation p-value: 0.03796203796203796
# Train set
# Make "predictions" on training set (in-sample)
positions = np.where(best_model.predict(X_train)> 0,1,-1 ) #POSITIONS
#dailyRet = pd.Series(positions).shift(1).fillna(0).values * x_train.ret1 #for trading at the close
syst_return = pd.Series(positions).fillna(0).values * Y_train #for trading right after the open
syst_return = syst_return.fillna(0)
cumret = np.cumsum(syst_return)
plt.figure(1)
plt.plot(cumret.index, cumret)
plt.title('Cross-validated LogisticRegression on currency: train set')
plt.ylabel('Cumulative Returns')
plt.xlabel('Date')
plt.show()
# Train set
# Make "predictions" on training set (in-sample)
#positions = np.where(best_model.predict(x_train)> 0,1,-1 )
positions = np.where(best_model.predict(X_test)> 0,1,-1 ) #POSITIONS
#dailyRet = pd.Series(positions).shift(1).fillna(0).values * x_train.ret1 #for trading at the close
syst_return = pd.Series(positions).fillna(0).values * Y_test #for trading right after the open
syst_return = syst_return.fillna(0)
cumret = np.cumsum(syst_return)
plt.figure(1)
plt.plot(cumret.index, cumret)
plt.title('Cross-validated LogisticRegression on currency: train set')
plt.ylabel('Cumulative Returns')
plt.xlabel('Date')
plt.show()
# Import the required methods from pickle
from pickle import dump, load
# Specify the filename
filename = 'best_model.sav'
# Save the model to disk
dump(best_model, open(filename, 'wb'))
print(f"Model saved as {filename}")
Model saved as best_model.sav
# Load the saved model from disk
best_model = load(open('best_model.sav', 'rb'))
# Verify the loaded model
print(f"Model loaded: {best_model}")
Model loaded: Pipeline(steps=[('scaler', StandardScaler()), ('pca', PCA(n_components=0.85)),
('abr',
AdaBoostRegressor(learning_rate=0.01, n_estimators=200,
random_state=42))])
We can conclude that with the introduction of technical indicators does indeed lead to slightly better result compare with the seed code. The process of finding the most fit technical indicator is a time consuming process. With the usage of different feature selection method, we are able to pick the top 5 most informative features with respect to the target.
In this case study, PCA, Mutual info Regression, f_regression, and random forest importance method is used to select the top 5 technical indicators that are going to be applied to both machine learning and deep learning model.
Lastly, by implementing the financial metrics (Sharpe Ratio, Proft Factor, CAGR) gives us an insight of how our trading system perform as a whole. Also, by testing the trading system with White Reality Check and Monte Carlo Permutation ensures us that the trading system doesn't happen by chance, instead it does learn the pattern from the dataset and does have the confidence of generating a promising prediction